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Introduction 

The fast Walsh Hadamard transform (WHT) is a little known yet rather useful 
algorithm for machine learning and neural networks. It is easy to understand and 
utilize if compared to the fast Fourier Transform. It presents especially low 
computational requirements. Indeed, it was popular in the 1960s and early 1970s 
for that exact reason. 


How about using it to improve the speed and power efficiency of modern 
machine learning and neural network algorithms? 


In this special report a relatively non-technical introduction is given, together with 
many example applications. From the classical image compression application, to 
fair dimensional reduction and increase, locality sensitive hashing, associative 
memory and super-fast neural networks. 


Sean O’Connor 10 May 2021 


Chapter 1. The 2, 4 and 8-point Walsh Hadamard transform. 
Simply adding and subtracting a sequence of 2 numbers a and b will give you the 
2-point Walsh Hadamard Transform (WHT) of the pair. 


WHT(a,b) = a+b, a-b 


The operation is self-inverse if you allow for a scaling factor. 
WHT(WHT(a,b)) = WHT(a+b,a-b) = (a+b)+(a-b), (a+b)-(a-b) = 2a, 2b 


The scaling factor needed after applying the 2-point WHT twice is obviously ”%. 


The scaling operation can be distributed between the 2 WHT calculations giving a 
scaling factor of 1/V2 per 2-point WHT. You might call this the scaled or 
normalized Walsh Hadamard Transform WHTy. 


WHT (a,b) = (at+b)/V2, (a-b)/v2 


WHTy(WHTy(a,b)) = a, b 


WHT Example Calculations: 
WHT(0,0) = 0, 0 
WHT(1,0) = 1, 1 
WHT(0,1) = 1, -1 
WHT(1,1) = 2, 0 


To obtain the 4-point WHT of the sequence of numbers a,b,c,d you can first 
calculate WHT(a,b) and WHT(c,d). That results in 2 sum terms, a+b from the first 
WHT and c+d from the second WHT and the 2 difference terms a-b and c-d. 


You then form the sum and difference of the 2 sum terms and the sum and 
difference of the 2 difference terms to calculate the full 4-point transform. 


WHT(a,b,c,d) = (atb)+(c+d), (a+b)-(c+d), (a-b)+(c-d), (a-b)-(c-d) 
WHT(a,b,c,d) = atb+c+d, at+b-c-d, a-b+c-d, a-b-c+d 


Since each of the individual steps is invertible the entire transform is invertible 
and you can see interesting patterns of addition and subtraction starting to 
emerge. 


The 8-point WHT of the sequence a,b,c,d,e,f,g,h can be obtained from the 4-point 
transforms WHT(a,b,c,d) and WHT(e,f,g,h). Then add and subtract alike (in pattern 
of +-) terms. 


WHT(a,b,c,d) = atb+c+d, at+b-c-d, a-b+c-d, a-b-c+d 
WHT(e,f,g,h) = e+f+gt+h, e+f-g-h, e-f+g-h, e-f-g+h 
Add and subtract (a+b+c+d) and (e+g+gth). 

Add and subtract (a+b-c-d) and (e+f-g-h). 

Add and subtract (a-b+c-d) and (e-f+g-h). 

Add and subtract (a-b-c+d) and (e-f-g+h). 


Chapter 2. The matrix perspective. 
There is a natural ordering for the results of transform that many algorithms give. 
So far, the results have not been presented in any particular order. 


For the 4-point transform the natural order is: 
at+b+c+d 

a-b+c-d 

atb-c-d 

a-b-c+d 


Which can be put matrix form: 


+1 +1 +1 +1 
+1 -1 +1 -1 
+1 +1 -1 -1 
+1 -1 -1 +1 


Figure 1. 
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The product of the top row with the column (a,b,c,d) is atb+c+d. 
The product of the next row with the column (a,b,c,d) is a-b+c-d. 
and so on. 


The matrix form for the 8-point transform in natural order is: 
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Figure 2. 


An important point you can immediately notice is the number of calculations 
needed to calculate the WHT are far less than if the equivalent matrix multiply 
was calculated. 


For the 4-point transform the total number of add and subtract operations 
needed is 8, for the matrix equivalent 16 multiply and 16 add operations are 
needed. 

For the 8-point transform the total number of add and subtract operations 
needed is 24, for the matrix equivalent 64 multiply and 64 add operations are 
needed. 

You can also notice that in the natural ordering the matrices are symmetric. The 
first row is the same as the first column, the second row is the same as the second 
column in the square matrix, etc. 


Since it is a square orthogonal matrix it is self-inverse. There is no need to devise 
a special algorithm to unwind, step by step a prior transformation. You just 
reapply the same transform: 
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Figure 3. 
Scaling by the appropriate factor of % gives back the original sequence 2,3,4,5. 


Chapter 3. The H Matrix Form. 
The Walsh Hadamard transform in matrix form (H) can be defined using the 
following constructive scheme: 
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Figure 4. 


From this you can conclude the H matrix is square matrix of order 2" where n is an 
integer. The matrix form also defines the natural ordering of the transform. The 
fact that it is self-inverse (HH=I) means the transform does not destroy 
information that passes through it and it leaves vector length unchanged aside 
from a constant scaling factor. 








Figure 5. 


In figure 5, *u is the upper half of a column vector and *1 is the lower half. When 
the square matrix and the column matrix are multiplied you notice that *u 
associates with the left-hand side of the square matrix and *| associates with the 
right-hand side. In the upper half of the square matrix an addition of alike terms 
occurs and in the lower half of the square matrix a subtraction of alike terms 
occurs. As was seen in the original unordered constructions of the 2-point, 4- 
point and 8-point WHTs. 


Chapter 4. The WHT basis vectors from square waves. 

You can generate square waves by ANDing a binary integer counter with specific 
integer powers of 2 and checking if the result is zero or not, or checking if the 
parity is odd or even. 


The parity is given by whether the number of bits set in some binary integer is 
even or odd, and is calculated by XORing all the bits of the binary integer 
together. 


For a parity-odd function p(x), that returns 1 when an odd number of bits in x are 
set (and zero otherwise), and a counter c running from O to say 7 you can 
generate 4 square waves, including the straight line from the equation: 


x=1-2*p( c AND r ), r=0,1,2,4 


Figure 6. Square waves from the function x=1-2*p(c AND r ), where c is a binary counter and r = 
0,1,2,4. 


In binary form 0=0002, 1=0012, 2=0102, 4=1002 


The point of using the parity function to generate the square waves is that it 
provides an easy way to multiply waves together (by XORing them) in various 
combinations. For example, setting r=6 (1102) gives a waveform that is the 
product of the square waves r=4 and r=2. 


r=6 


+1 
0 
-1 


Figure 7. 


All the various different products of the square waves can be calculated by 
incrementing through the meaningful values of r. That results in the same 
patterns of +1, -1 in the same natural order as seen in the basis vectors of the 
Walsh Hadamard matrix H. 


r=0 r=4 
+1 +1 
0 0 
-1 -1 
r= r=5 
+1 +1 
0 0 
-1 -1 
r=2 r=6 
+1 +1 
0 0 
-1 -1 
r=3 r=7 
+1 +1 
0 0 
-1 -1 
Figure 8. 


An important point is that all the waveforms are orthogonal to each other. The 
dot product of any 2 different waveforms gives 0. That is because they all differ by 
at least one square wave from each other. 


Chapter 5. Sequency. 
Sequency is the WHT equivalent of frequency. It is a count of the number of 
transitions between +1 and -1 or -1 and +1 in the basis vectors of the WHT. 


In figure 8: 

r=0 gives sequency O 
r=1 gives sequency 7 
r=2 gives sequency 3 
r=3 gives sequency 4 
r=4 gives sequency 1 
r=5 gives sequency 6 
r=6 gives sequency 2 
r=7 gives sequency 5 


To convert from the natural order position to sequency you reverse the bits of r 
and then calculate the inverse Gray code (x XOR (x>>1) XOR (x>>2)... until the 
shifted value is zero): 


r=0002, reverse(r)=0002, igc(O0002)=0002 =O 
r=0012, reverse(r)=1002 igc(1002)=1112 =7 
r=0102 reverse(r)=0102 igc(0102)=0112 =3 
r=0112, reverse(r)=1102 igc(1102)=100> =4 
r=1002 reverse(r)=0012 igc(0012)=0012 =1 
r=1012, reverse(r)=1012 igc(1012)=1102 =6 
r=1102 reverse(r)=0112 igc(0112)=010, =2 
r=1112 reverse(r)=1112 igc(1112)=1012 =5 


To convert sequency to natural order position, calculate the Gray code (x XOR 
(x>>1)) and then reverse. 


Chapter 6. The Out-Of-Place Walsh Hadamard transform. 

Given an array of data that is some integer power of 2 in length, you can step 
through the array data pairwise. The sum of each pair of array elements is placed 
sequentially in the lower half of a temporary array (the same size as the data 
array), and the difference of the pair of elements is placed sequentially in the 
upper half of the temporary array. Alike terms end up being grouped sequentially 
in the temporary array. You then swap the data array and temporary array and 
repeat the process. Stopping when there are no more alike terms to add and 
subtract, that is to say when all the output terms are unique patterns of addition 
and subtraction. That happens after log2(n) stages where n is the length of the 
original data array. 


Example: 
a a+b a+b+c+d 
b c+d a-b+c-d 
_" = 
C a—b a+b-—c-d 
d c-d a-—-b-—c+d 
Figure 9. 


JavaScript Code: See Appendix A. 


Chapter 7. The in-place algorithm. 

The in-place algorithm requires you to keep track of where alike terms are. Noting 
that alike terms that have been added and subtracted and put back in place are 
no longer alike. The indexing requirements then involve pairwise blocks of data. 
The block size increasing at each stage of the calculation. 


a (1) a+b(1) a+b+c+d (1) a+b+c+d+e+f+g+th (I) 
b (1) a —b (2) a-b+c-d (2) a—b+c-—dt+e-—f+g-h(2) 
c (1) c+d (1) a+b-—c-—d (3) a+b—-c—d+t+e+f—g-h(3) 
d (1) a c—d (2) 3 a-—b-—c+d (4) = a-b-b+d+e-f-g+h(A4) 
e (1) e+ f (1) RE ae a+b+c+d-—e-f-—g-h(5) 
f (1) e—f (2) —f+g-h(2) a-—b+c—d-—e+f—g+h(6) 
g (1) g+h (1) ns g—-h(3) a+b-c-—d-e-f+g+th(7) 
h (1) g—h (2) e-f—-g+h(A) a-—b-c+d-e+f+g-—h(8) 


Figure 10. The in-place 8-point WHT. Alike terms have the same number in brackets. In the 
first stage the block size is 1, the second stage 2, the third stage 4. You go through the blocks 
pairwise, adding and subtracting alike terms. 


In figure 10 you start with block sizes of 1, a is in one block of size 1, b is in one 
block of size 1. You form the sum and difference and put the results back in the 
array. Then you move on toc and d etc. 


In the second stage a+b and a-b are in one block of size 2, and c+d and c-d are in 
the next block of size 2. You add and subtract the alike terms atb and c+d from 
the 2 blocks and put the results back where they originated, then you add and 
subtract the alike terms a-b and c-d from the 2 blocks and put the results back 
where they originated, then move on to the next pair of blocks. The algorithm 
completes after log2(n) stages where n is the input size (an integer power of 2.) 


JavaScript code: See Appendix B. 


Chapter 8. Computational aspects. 

The computational cost of the in-place and out-of-place WHT algorithms is very 
low, requiring only nLog2(n) add and subtract operations where n is the size of the 
input vector. Viewing the WHT as a system of fixed dot products with an input 
vector the cost per dot product is a very low Log>(n) operations. 


Log>(1)=0 1*Log>(1)=0 
Log>(2)=1 2*Log>(2)=2 
Log>(4)=2 4*Log,(4)=8 
Log>(8)=3 8*Log>(8)=24 
Log>(16)=4 16*Log>(16)=64 


Log>(32)=5 32* Log>(32)=160 
Log>(64)=6 64*Log>(64)=384 
Log>(128)=7 128*Log>(128)=896 
Log>(256)=8 256*Log>(256)=2048 





Figure 11. Cost factors for some small sized WHTs. 


Generally the in-place WHT algorithm is the fastest as it places the least stress on 
the CPU (or GPU) cache memory system. It is possible to mix together the in- 
place and out-of-place algorithms and to process the data in small batches to 
optimize the speed of the transform. 


Typically you can expect to calculate 1000 65536-point WHTs per second per CPU 
core, using ordinary C or Java code. If you use the special SIMD CPU intructions 
that can increase to 5000 65536-point WHTs per second per CPU core. 


There is a FFHT (Fast Fast Hadamard Transform) library available on the internet. 


An input vector to a WHT with a single non-zero element produces a sequency 
pattern at the output. That sequency pattern maps back to the original point 
when a second transform is applied. 


Because the transform is linear changing a single input element to the WHT 
changes all the output elements in a sequency pattern regardless of anything else. 


The one-effects-all property means the WHT is the fastest way to fully mix 
together highly dimensional data on digital systems. Only algorithms that copy its 
structure are equally as fast. 


The highly structured sequency patterns of the WHT correlate well with natural 
data such as images, allowing the algorithm to be used for compression. However 
the transform is not bandwidth limited (because square waves are not) unlike the 
FFT which uses sine and cosine waves as basis functions. 


The vector U=WHT)(x)+x is left unchanged by a further WHTy (scaled or 
normalized WHT) operation. The equation acts a filter. 


WHTy(WHTj(x)+x)) = X# WHT) (x) 

Therefore: WHT)(u) = u 

Likewise the vector v=WHTy(x)-x is left unchanged except for a 180 degree flip. 
WHTy(WHTj(x)-x)) = Xt WHT y(-x) = X-WHTy(x) = -( WHTn(x)-x)) 

Therefore: WHTy(v) = -v 


Since u and vare basically left unchanged they are eigenvectors of WHTy and the 
eigenvalues are +1 and -1. An arbitrary vector x can be split into u and v and then 
reconstructed (2x = u-v.) 


Chapter 9. Review of the dot product. 

The dot product of two vectors a=[aj, a, ..., an] and b=[b,, bo, ... , Dy] is: 
ab= > aib; for i=1 ton (=azbit+ta2bo+ ... t+anbn) 

The geometric definition is: 


a:b = |lal| |||] cos(@) where |lal| and ||b|| are the magnitudes of a and b and @ is the 
angle between them. 


Then very usefully the algebraic definition can be used to calculate the angle 
between 2 vectors a and b: 


% = arccos(a-b/ |[al| ||5]| ) 


In relation to the Walsh Hadamard transform it is useful to know that the variance 
equation for linear combinations of random variables applies to the dot product. 


For a random variable X: 


E[aX + b] = aE[X] + b 


Var[ax + b] = a?Var[X] 

For random variables X;, X2,..., X, with arbitrary constants C1, C2,..., Cn! 
E[c1X1 + CoX2 +... + CnXp] = C1E[X1] + CoE[X2] +... + CE[X,] 

If Xi, X2, .., Xn are independent: 

Var[c1X1 + CoXX2 +... + CnXp] = Cx? Var[X1] + Co?Var[X2] +... + Cn*Var[X,] 


The final 2 equations apply to a dot product (c-X) between an ordinary vector c 
and a random vector X whose elements are independent random variables. 


For dot products of a random vector X with a vector like h=[+1,-1,-1+1] (or similar 
non-sparse vectors) the central limit theorem applies. 


Note: The central limit theorem applies not only to sums but equally to sums and 
differences in terms of the Normal distribution and variance. 


Central Limit Theorem: 


If X1, Xo, ... , Xn iS a Sample from a distribution with mean py and standard deviation 
o and if n is large the sample mean of X;,, X2, ... , Xn has: 


The Normal distributution (a Bell Curve distribution.) 
Expected sample mean wi. 
Expected sample variance o7/n. 


Expected sample standard deviation o/vn. 


As n increases the sample standard deviation falls with the square root of n. 
You can calculate the sample mean with a dot product via 
[1/n,1/n,...,1/n] -X. 


Also it wouldn’t matter to the variance equation or standard deviation equation if 
some of the elements in the first vector had their sign flipped (eg. [1/n, -1/n, - 
1/n, 1/n]), however it does matter to the actual sample mean. 


Chapter 10. The Normal distribution from the WHT. 

Because all the entries in a WHT H matrix are either +1 or -1 you can make use of 
the central limit theorem to generate a sequence of random numbers with the 
Normal distribution from a sequence of random numbers with the uniform 
distribution. 


That is rp = WHT(ru) where ry is a vector of random numbers with the Normal (or 
Gaussian) distribution and ry is a vector of random numbers with the uniform 
distribution. 


It is best to use uniform random numbers centered at zero since the top row of 
the H matrix is all +1s and the other rows have 50% plus 1 and 50% minus 1 
entries. Otherwise, the top row and the other rows will have different means. 


Since the rows of the WHT H matrix are orthogonal the Normal random variables 
produced are independent from each other in that regard. 


And yet there is still a connection between them - the WHT leaves vector length 
unchanged. If one Normal random variable is especially large in magnitude all the 
others must therefore be a little smaller in magnitude than you might otherwise 
expect. 


That is, there is an entanglement through vector length. 

There is a second problem you might encounter: 

ru = WHT(WHT(ry)) 

You can get back to the uniform distribution by applying the WHT again. 


You can attempt to hide the capacity to invert by applying a random permutation 
to the generated Normal random variables or applying a random pattern of sign 
flipping to the generated Normal random variables. In adversarial settings the 
obfuscation might be discovered. 


There may also be other detail level mathematical issues, see: ‘A new method of 
generating Gaussian random variables by computer’ on archive.org. 


However, for many practical applications it provides a fast method to generate 
the Normal distribution. 


Constraints on vector length and vectors of Normal random variables are related 
to uniform point picking on the surface of a hypersphere, since to do point picking 
you simply generate a vector of Normal variables and then adjust the vector 
length to the required radius (Muller 1959.) 


Chapter 11. Data Compression. 

One of the earliest applications of the WHT was for image compression. You 
transform the image data and keep the highest magnitude components along 
with their location, discarding the rest. To decompress you just reinsert the kept 
data into a previously zeroed array and transform again. This works because 
natural world data often strongly matches the basis patterns of the WHT. 


The WHT has very strong spectral behavior on natural data, similar to the FFT. 
However the FFT is bandwidth limited because its basis functions are sine and 
cosine waves, the WHT is not and this is shown in sharp edged blocks appearing 
at high compression rates. 


At lower compression rates Normally distributed noise predominates due to 
random mismatching between the image data and the basis patterns of the WHT. 
Normally distributed noise is often removed from images with a median filter. 


JavaScript Code: See Appendix C. 





Figure 12. Clockwise from top left, original image, 100 dimension, 1000 dimensions, 10000 
dimensions. 


In practice it would be better to use a heap algorithm to select the k-highest 
magnitude components of the WHT, rather than use a sort algorithm. 


You may also use the intermediate calculations of the WHT for image 
compression as they offer a mixed encoding of position and pattern similar to 
wavelets. You normally use the out-of-place WHT algorithm with a scaling factor 
of 1/V2 at each stage to keep the vector length constant throughout. One 
compression scheme is to select the highest magnitude component (storing its 
value and its location) then remove it from the intermediate calculation and 
recalculate everything from that change. You repeat the process as many times as 
you want. To decompress simply add in the stored values at their correct 


locations in the intermediate calculations of a decompress WHT. That is quite a 
complex scheme though. 


Chapter 12. Random Projections. 
Random projection (RP) is a simple technique with a number of applications in 
machine learning. It is a highly disordered linear mapping of one space to another. 


It preserves the distance between the original and the projected data points with 
high probability. 


It is often used for unbiased dimensional decrease or increase with no attempt to 
tune the response to any particular features of the input data. 


The WHT produces an exceptional output distribution in 2 cases: 


Natural data like images match well with the sequency patterns of the WHT 
allowing a spectral response. Most of the information in an image is concentrated 
into a few dimensions when transformed, which can be picked out for data 
compression. 


Very sparse inputs also produce exceptional outputs in the form of visible 
summations of sequency patterns. Due to the point to sequency pattern behavior 
of the WHT. 


However, those are exceptional responses to a small sub-space of the possible 
inputs to a WHT. 


Almost everything else will not match in a discernible way with any of the 
sequency patterns. Such inputs then are effectively random variables and will 
produce Normally distributed noise when transformed by the WHT. 


One way then to create a random projection is to disorder the Walsh Hadamard 
transform is by multiplying it with a diagonal matrix containing random +1, -1 
entries. 











+1 +1 +1 +1 +1 +1 +1 °4+1//+1 0 0 0 0 0 0 0 +1 +1 +1 -1 -1 -1 -1 +1 
+1 -1 +1 -1 +1 -1 +1 -1 0 +10 0 0 0 0 0 +1 -1 +1 +1 -1 +1 -1 -1 
+1 +1 -1 -1 +1 +1 -1 -1 0 0 +1 0 0 0 0 0 +1 +1 -1 +1 -1 -1 41 -1 
+1 -1 -1 +1 +1 -1 -1 +1 0 0 0 -1 0 0 0 Of _} 41 -1 -1 -1 -1 41 41 41 
+1 +1 +1 +1 -1 -1 -1 -1 0 0 0 0-10 0 Of |41 41 41 -1 41 41 41 -1 
+1 -1 +1 -1 -1 +1 -1 +1 0 0 0 0 0 -1 0 O +1 -1 +1 +1 +1 -1 +1 +1 
+1 +1 -1 -1 -1 -1 +1 +1 0 0 0 0 0 0 -1 0 +1 +1 -1 +1 +1 4+1 -1 +1 
+1 -1 -1 +1 -1 +1 +1 -1 0 0 0 0 0 0 0 41 +1 -1 -1 -1 +1 -1 -1 -1 





Figure 13. Disordering the WHT. 


A simpler way to do exactly the same thing is to apply a fixed randomly chosen 
pattern of sign flips (+-) to the input data before a WHT. 


Certainly, on a large scale that is enough to transform an image into a Normally 
distributed form. For sparse input data where most of the input data elements 
are zero it is not enough. However, you can repeat the process. The first RP 
(disordered WHT) transforms the sparse inputs into a non-sparse summation of 
some sequence patterns. Obviously a further disordered WHT will scramble those 
sequency patterns and transform them into Normally distributed data. 


You can use random permutation as an alternative to random sign flipping if the 
input data has zero mean or you can use it together with sign flipping. 


The WHT, random sign flipping and random permutation are all invertible, hence 
random projections based on those are invertible: 


Random sign flipping and random permutation leave vector length unchanged. 
Random sign flipping changes the mean of the data it is applied to. 
Random permutation does not change the mean of the data it is applied to. 


To reduce the dimensionality of some input data you calculate a random 
projection and sample the number of dimensions you require, it doesn’t matter 
which you chose. 


To increase the dimensionality of some input data, reapply the RP multiple times 
with different randomly chosen patterns of sign flips. 


Chapter 13. Random Projection Applications. 
Applications: 


Data hiding and security applications. 

Dimension reduction for nearest neighborhood search. 
Clustering. 

Random projection trees. 

Compressive sensing. 

Locality sensitive hashing. 

Data reservoirs. 

Filtering. 

Evolutionary algorithms. 

Neural network preprocessing and post-processing. 
Looking at the lesser-known applications: 
Compressive sensing. 


If you use a random projection for dimension reduction you can reconstruct the 
original data by inverting the RP. By construction RPs do not concentrate 
information, they spread it out in a fair way among all the output vector 
elements. Depending on the degree of dimension reduction the reconstruction 
could be quite poor. However, as was said before natural data has a lot of 
preexisting structure, such as being relative unchanged by moderate smoothing. 


If you say the natural data is in the coherent domain and the randomly projected 
data is in the incoherent domain. Starting in the incoherent domain you can ‘fix’ a 
zeroed vector by setting the subset of vector elements in common with the RP 
dimension reduction to the corresponding RP values, and leave all the other 
elements alone. Then invert to the coherent domain. Smooth with a binomial 
filter, which would leave a true image relatively unchanged. Transform to the 
incoherent domain and fix again. Repeat many times. 





Figure 14. Original image (65535 pixels), inverse from random projection dimension reduction 
to 10000 points, reconstruction by smoothing from the same random projection dimension 
reduction. 


Locality sensitive hashing. 


A Locality Sensitive Hash (LSH) is a hashing algorithm that is not as drastic as a full 
hash. The binary output vectors produced by inputs vectors that are close to each 
other, differ only a small number of bits. 


You can convert (binarize) the vector elements of a random projection to 0 and 1 
values depending on whether an element is less than zero or greater than zero, to 
create a LSH. 


For some applications -1 and 1 output vector elements are preferred. 


Data reservoirs and Filtering. 


If you use a normalized transform WHTy as the basis of a random projection then 
vector length is unchanged by the RP, and the output of the RP can be fed back 
into its input, producing a dynamic reservoir that acts a sort of oscillator. You can 
add in information at each recurrence along one or more dimensions. Some 
dynamic input patterns can cause forced oscillation, where the amplitude (vector 
length) builds up quickly. 


Evolutionary algorithms. 


If you have a population of 32 (or some other integer power of 2) individuals with, 
say 5 parameters each. Then for each of the 5 parameters you can do a special 
random projection on the 32 values of that dimension to obtain a population 
based Normal distribution with the original mean intact. 


First do arandom permutation, which leaves the mean unchanged but 
randomizes everything else. Then a WHTy. Since the mean is intact the all +1s 
term of the WHT, will extract that. Then do a random sign flip of all the elements 
except the mean containing term. And a further WHTy. The variation in the 
population dimension has been converted to Normally distributed noise and the 
mean has been preserved. 


After all 5 parameters have been processed that way you have generated a new 
population to evaluate. 


Neural network preprocessing and post-processing. 


You can use a RP to convert sparse input vectors to non-sparse vectors as 
preprocessing for a conventional fully connected neural network. Or to increase 
the number of dimensions for a very wide net. 


Using RP preprocessing may in some cases allow you to avoid including bias terms 
in your neural model because the inputs become centered around zero. 


You cannot use RP as preprocessing for convolutional neural networks. The 
reason is the response of a RP to a local pattern in the input varies completely 
randomly according to it position. 


You can use RP as a pseudo-readout layer for neural networks. Avoiding all the 
weight parameters needed for a conventional readout layer. For outputs that are 
expected to be dense, like image data you might use a WHThy, followed by a fixed 
randomly chosen pattern of sign flips. For outputs that are expected to be sparse, 
such as in classification tasks a simple WHTy on its own might be ideal. 


Data means and random projections. 


The elements in a vector supplied to a RP may have a non-zero mean. You can 
choose what to do about that mean: 


You can fully incorporate it into the output of the RP. For example, random sign 
flips followed by a WHTy. And repeat for sparse input data. 


You can exclude it by first doing a random permutation, then a WHTy, then zero 
the all +1 term of the WHTy, then random sign flips and a further WHTy. 


You can leave the mean intact. First do a random permutation, then a WHTy, then 
random sign flip the data, excluding the all +1 term of the WHTy. Then a further 
WHT. 


You can create a self-inverse random projection. If Sis a function applying a fixed 
randomly chosen pattern of sign flips then S(WHTj\(S(x))) is self-inverse. 


S(WHTn(S(S(WHTy(S(x)))))) = SCOWHT n(WHTn(S(x))) = S(S(x)) = x 


Since WHTy, random sign flipping and random permutation leave vector length 
unchanged any combination of those effect a pseudo-random rotation of an input 
vector from one point on the surface of a hyper-sphere to another. The point to 
pattern behavior of the WHTy allowing for meaningful combinatorial behavior. 


For some applications the Normally distributed output of random projections is 
not ideal and the uniform distribution would be preferred. If you arrange for the 
output elements of the RP to be in standard normal form with mean O and 
variance 1 you can apply the Normal Cumulative Density function to the elements 
to get uniforms between O and 1. 


You can alternatively apply the function atan2(x1, x2) to pairs of elements to get a 
uniform between —Pi and Pi. There is no need in that case for the variance to be 
constrained to 1. 


Chapter 14. Associative memory. 

What are the requirements to effectively store information in a dot product and 
weight vector combination? Answering that allows you to set up the necessary 
pre-processing to create a general form of associative memory. 


The dot product with noise. 


Considering an input vector x with equal independent noise terms ne (with 
variance 1) added to each input element (as a test of noise sensitivity) and a 
weight vector w. 


If you want to get the value 1 from the dot product x-w you can make one input 
vector element 1 and the corresponding weight vector element 1. 


x = [O+ne, O+Ne, 1+Ne, O+Ne] 
w = [0, 0, 1, O] 
x-w=(0+ne).0 + (O+Ne)..0 + (1+ne).1 + (O+n.)..0=1+ ne 


You have cut away most of the noise terms and are left with one noise term with 
variance 1 and standard deviation 1. 


As an alternative you can make the elements of x all equal to 1 and all the weights 
equal to 1/n, where n is the dimension of the vectors. 


x = [1+ Ne, 1+ Ne, 1+ Ne, 1+ ne] 
w = (1/4, 1/4, 1/4, 1/4] 
x-w=(1+ ne).1/4 + (1+ ne).1/4+ (1+ ne).1/4 + (1+ ne).1/4=1+n; 


By the variance equation for linear combinations of random variables the variance 
of ns is 4((1/4)2) = 1/4 and the standard deviation is 1/2. 


Which is to say averaging is better than cutting and that non-sparse inputs to a 
dot product associative memory are preferred. 


In both cases the angle between the input vector and the weight vector is zero 
(ignoring the noise terms.) Suppose you were to increase (by any means) the 
angle toward 90 degrees, then the length of the weight vector must increase to 
keep getting the value 1 out of the dot product since it is diminishing toward zero 
at exactly 90 degrees anyway. Then you find the output of the dot product 
becomes extremely noisy. 


The output of the dot product becomes extremely sensitive to noise in the input 
vector elements at angles approaching 90 degrees. 


Hence, small dot product angles are preferred for a dot product associative 
memory. 


In fact, there is a certain zone where the inputs and weights are distributed 
enough (non-sparse) and the angle small enough where you get less noise out 
than occurs on a single input element. 


What happens as you store multiple items of information into a dot product as 
< vector, scalar > associations, for example < [1, 1, 1, 1], 1>? 


If you store one < vector, scalar > association and the training algorithm is any 
good then the angle between the input vector and the weight vector will be zero. 


If you store 2 < vector, scalar > associations then almost certainly the angles to 
the weight vector won't be zero. 


As you store more and more associations the average angle between the input 
vectors and weight vector must increase. To keep getting the wanted scalar 


values out with the increasing angles required, the length of the weight vector 
must increase. The sensitivity to noise in the input vectors must increase 
accordingly. At some point you cross the boundary from a contractive mapping to 
an expansive one. 


The associative memory learning process involves solving the system of linear 
equations the < vector, scalar > associations represent. Ideally then the input 
vectors used in the < vector, scalar > associations should be linearly independent 
to achieve the maximum (zero error) storage capacity. Which is the dimension of 
the dot product. 


The pre-processing requirements for a dot product with a weight vector to act as 
a general associative memory are: 


To convert any sparse input into a non-sparse form. (error correction.) 
To produce vectors that are independent. (capacity.) 


To generalize outside the training set. (usefulness.) 


A Locality Sensitive Hash (LSH) algorithm with +1, -1 binary terms is a reasonable 
solution to those requirements: 


Such a LSH converts any input into a dense pattern of +1 and -1 terms. 


Random vectors (produced by a hash algorithm) are approximately orthogonal in 
higher dimensional space. And even with a LSH you have a high probability of 
getting close to the maximum capacity. 


A small change in the input to a LSH only flips a small number of bits in the output 
vector. Meaning you will get reasonable nearby generalization, especially when 
the associative memory is used in an under-capacity, noise reduction mode. 


The recall algorithm given an input vector is: 


Calculate the Locality Sensitive Hash of the input vector. 


Multiply each +1, -1 bit of the hash vector by a weight and sum (dot product the 
LSH vector and the weights.) 


Return the sum. 


The online training algorithm is given a <vector, x(=scalar)> pair to learn: 
Recall using the input vector and calculate the error (x - recall.) 
Divide the error by the dimension of the dot product. 


For each weight, multiply the divided error by the corresponding +1, -1 LSH bit and 
add to the weight. 


This process makes the recall error zero for the <vector, scalar> pair being stored. 
The correction is not likely strongly correlated with the hash pattern of any prior 
stored pair. The effect then of the current store is to add a small amount of 
Normally distributed noise to all previously stored <vector, scalar> associations. 


From the perspective of the prior stored pairs the new correction is not correlated 
and will largely cancel out by averaging, leaving a small residue of Normal noise. 


If the number of stored <vector, scalar> associations is within the capacity of the 
system you can drive the recall errors to zero by repeated retraining. Which is to 
effectively solve the system of linear equations involved in an online way. 


You can use the dimensionality increase capacity of Walsh Hadamard based 
random projections to make the system capacity anything you like. 


You can also further use the LSH to switch between different blocks of weights (in 
various ways) to increase capacity while keeping the compute cost reasonable. 


You can use soft binarization for the LSH. The signed square root function 


f(x) = Vx, x>=0 
f(x) = -v-x, x<O 


has attractor points at +1 and -1 for example. Using that needs mathematical 
care to ensure numeric stability. 


It is easy to do many <vector, scalar> operations at the same time. Making 
<vector, vector> associative memory possible: 


JavaScript code: Appendix C. 


Used under capacity such associative memory has noise reduction (error 
correcting) properties. Though not as useful as might be supposed because the 
error correction is highly biased in the direction of the weight vector(s), which in 
the <vector, vector> case is a sort of composite of all the stored pairs. 


On solution to improve error correction in the <vector, vector> case is to create a 
random vector r, for each vector pair to be stored. E.g., Store <a;, m> and then 
store <r, bi>. You can then use a; to recall r,; and use r; to recall b;. Since the r,’s 
are random, their composition lacks bias. The random vector need not be of the 
same dimensionality as a, or b,. It is also possible to use error correcting codes as 
intermediates. 


Understanding how information is stored using dot products and weight vectors is 
also important for neural networks. The more information you store the greater 
the average angle between the input vectors and the weight vector, which results 
in greater sensitivity to noise in the inputs. There is also a zero-angle attack 
where you input a vector pointing in the same direction as the weight vector to 
get an extreme response. Similar to supernormal stimuli effects seen in biological 
neural networks. 


Chapter 15. ReLU as a switch and switched weighted sums. 
The ReLU activation function used in artificial neural networks can be viewed as a 
switch. 


ReLU is defined as: 
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Figure 15. 


You can view f(x) = x as connect and f(x) = 0 as disconnect. There is both a binary 
and continuous aspect to ReLU. For example, a light switch in your house is 
binary on off yet connects and disconnects a continuously variable AC voltage. 


You are free to view the switching action and the switching decision (the 
predicate x<=0) separately. 


A conventional fully connected ReLU neural network then, is a switched 
composition of weighted sums. For a particular input to such a neural network 
the switch states are decided and become known layer after layer. At the end, all 
the switch states are known and the neural network is a complicated switched 
composition of weighted sums. Which actually is a linear system as long as the 
switch states stay frozen. It is also subject to simplification. 


If you are given the weighted sums: 
u=2.x+ 1y+3.z 
v=4.x+2.y + 2.z 


w=1.x+1.y+1.z 


And a composition of those: 
t=2.u+0.v+1.w 
Then the composition t can be simplified to: 


t=5.x+3.y+7.z 


Where perhaps v was removed from the composition t by a ReLU switch 
disconnecting it. 


When all the switch state become known, a fully connected conventional neural 
network collapses through simplification to a square matrix. There is a simple 
linear mapping from the input vector to the output vector. And each neuron’s 
value (before its ReLU) is simple weighted combination of the input terms. 


Since switching occurs at zero a gradual change to the input to a ReLU neural 
networks never produces a discontinuous change in the output. It is not 
absolutely essential that switching occurs at zero but is probably helpful. 


Parametric activation functions for neural networks are of some interest. For 
example, the Two-Slope Parametric (TSP) function: 
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Figure 16. 


If a= 1 and b =0 you have ReLU again. If a=1 and b = 1 you have f(x)=x for all x, 
which is a zero-curvature response and is entirely linear. Weighted sums 
switched by the Two-Slope Parametric function will be modulated in their entirety 
by a and b. 


Chapter 16. Fast Transform Fixed-Filter-Bank neural networks. 
The fast Walsh Hadamard transform can be viewed as a collection of dot products 
involving fixed weight vectors (of patterns of +1 and —1) and some input vector. 


In contrast, in a conventional neural network the weight vectors are adjustable. 
Otherwise, the 2 are interchangeable in a neural network layer. 


For a single layer it is just a question of swapping one type of weight matrix (an 
adjustable matrix) with another type, a H matrix (not adjustable and can be 
enacted with a WHT algorithm.) 


—1.56 -2.51 +1.01 +2.26 
-4.67 +1.14 +0.70 -0.95 
+1.93 +340 +2.56 42.43 
+245 -0.87 -3.32 +1.01 


+1 +1 +1 +1 


+1 -1 +1 -l 
+1 +1 -1 -l 
+1 -1 -1 +1 


Figure 17. Upper: A matrix of adjustable weights for a single layer of neural network. One row 
per neuron. Lower a H matrix of fixed weights. 


Using ReLU activation functions with H matrices the resultant neural network is 
completely frozen. It has some behavior but there is nothing you can adjust. 


Using Two-Slope Parametric activation functions (TSPs) in place of ReLU you can 
switch modulate each entire H matrix row with its own specific TSP. 


That ultimately results (over a number of layers) in very similar switched 
compositions of weighted sums to a conventional ReLU neural network. 


A comparison per layer where n is the width of the neural network: 
Conventional fully connected ReLU neural network: 
Computational cost: n? Fused multiply adds. 
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Parameters: Nn. 


Switching decisions: n. 


Fast Transform neural network using TSP activation functions: 


Computational cost: nLog2(n) add subtracts and n multiplies. 
Parameters: 2n. 
Switching decisions: n. 


Fast Transform WHT neural networks have a greatly reduced computational cost 
and a reduced number of parameters compared to conventional ReLU neural 
networks. It is interesting that both types of neural network have the same 
number of switching decisions. 


The information storage capacity of dot products with adjustable weighted sums 
means that for conventional networks the switching can be (in a sense) reused a 
number of times. 


In any event both types of neural network can fit arbitrary functions. Though 
obviously there will be some differences in behavior. For example, the Fast 
Transform based networks are more statistical in nature. Each neuron ina 
particular layer has forced +1 or —1 weighted connections to every neuron in the 
prior layer. And responds to the broad statistical patterns of the prior layer 
through the filtering action of its weighted sum. The weighted sum for each 
neuron being orthogonal to the other weighted sums in the layer. 


A technical point is you don’t want the first WHT in a Fast Transform neural 
network taking a spectrum of natural world input data. Which would prevent 
good utilization of switching decisions in the first layer. You can fix that by 
subjecting the input data to a fixed randomly chosen pattern of sign flips before 
the first WHT (=random projection.) Allowing all the switching decisions in the 
first layer to matter. 


You can also use a final WHT as a pseudo-readout layer (for sparse outputs like 
classification.) Or a WHT followed by randomly chosen sign flip for dense outputs 
like images. 


Such a neural network might be: 


Sign flips, WHT, TSP, WHT, TSP, ..., WHT, Sign flips. 


Zero slope initialization works very well for such networks, where you set all the 

ais and b;’s of the Two-Slope Parametric functions t;(x)’s to 1. If you include the 

constant scaling factors the network needs into the function parameters you can 
set all the a;’s and b;’s to the required scaling constant c, instead. 


With such zero slope initialization pairs of layers cause no change: 
WHTy, TSP,,, WHTn, TSP,-=Identity matrix. 


You can grow an existing network, without changing it by adding pairs of such 
layers. 


Chapter 17. Miscellaneous. 


One problem with random projections is local pattern data produces a completely 
different response depending on where it is in the overall data. A partial solution 
to that is to use sub-random projections. To do that you can use a sub-random 
pattern of sign flips followed by a WHT. For example, the sign flip pattern 
generated by: 


( sin(i * 1.895) < O )? +1, -1 where iis an integer from 0 to n. 


The idea is that the sign flips should be as orthogonal as possible to the Walsh 
Hadamard H matrix rows as possible while being somewhat structured. The value 
2.032 is also reasonable. Doubtless better patterns can be found. 


The WHT needs only patterns of add and subtract circuits. Integer add and 
subtracts only need a small number of transistors each. It should be possible to 
build very power efficient integrated circuits (chips) for the WHT. Extremely low 


power requirement neural networks should be possible. For example, for micro- 
robotics. 


Some online resources: 


Fast Walsh Hadamard CPU library. 

Generating the Gaussian Distribution with the WHT. 

WHT C code and information. 

Neural network based on the WHT and parametric activation functions. 
Neural network based on the WHT and multiple neural layers. 

WHT neural networks written in Java/Processing. 

Activation weight switching. 

Zero curvature initialization of neural networks. 

A frozen neural network. 

WHT - Faster Fast Fourier Transform. 


Fast transforms for sparsity. 
Switch Net 4. 


Appendix A. 


// JavaScript code for the out-of-place Walsh Hadamard transform. 
let x = [1, 2, 3, 4, 5, 6, 7, 8]; // Data to transform 
let out = []; // Temporary out of place array 
out.length = x.length; 
let stages = round(log(x.length) / log(2));_ // Number of transform stages 
for (let i = 0; i < stages; i++) { 
let lower = 0; // The start of the lower half of the array 


let upper =x.length /2; // The start of the upper half of the array 


for (let j = 0; j < x.length; j += 2, lowert++, uppert++) { // pairwise 
let a =x([j]; 
let b = x[j + 1]; 
out[lower] =a +b; // Place the sum term in the lower half 


out[upper] =a - b; // Place the difference term in the upper half 


lett =x; // Swap the x and out arrays 


console.log(x); 


Appendix B. 


// JavaScript code for the in-place Walsh Hadamard transform. 
let x = [1, 2, 3, 4, 5, 6, 7, 8]; 
wht(x); 


console.log(x); 


function wht(vec) { 
const n = vec.length; 
let hs = 1; // Gap to the next alike term 


while (hs < n) { 


let i=0; // Start position of the array 

while (i < n) { // While current stage not complete 
const j =i + hs; //j = limit for end of current block 
while (i < j) { // Go through the block 


let a = vec[i]; 


let b = vec[i + hs]; // Alike term in next block 
vec[i] =a +b; // Add subtract and put back 


vec[i + hs] = a-b; 


j+=1; // Next term in block 
} 
i+=hs; // Make i point to the next block pair 
} 
hs += hs; // Next stage - double the gap 
} 
} 
Appendix C. 


// Some JavaScript code for WHT image compression: 
let img; 

let vec = new Float32Array(65536); 

let decompress = new Float32Array(65536); 

let idxVec = []; 

let idxVecSelected; 


let keepBest = 5000; 


loadimg(img); // Load the image 
imageToVec(img, vec); // Convert to a vector of 65536 grey scale values 
wht(vec); // Transform 


for (let i = 0; i< vec.length; i++) { 
idxVec[i] = new Item(i, vec[i]); // Store and index, value pairs 


} 


idxVec.sort(function (a, b) { 


return Math.abs(b.value) - Math.abs(a.value); 
}); // Sort by magnitude 
idxVecSelected = idxVec.slice(0, keepBest); // Select for highest magnitude 
let scalingFactor = 1 / vec.length; // Scaling factor for 2 WHTs 
for (let i = 0; i< keepBest; i++) { // Place best into zeroed array 


decompress[idxVecSelected[i].idx] =scalingFactor*idxVecSelected[i].value; 


wht(decompress); // Transform again to decompress 


presentDecompressed(decompress); // Show the image 


class Item { 
constructor(idx, value) { 
this.idx = idx; 


this.value = value; 


Appendix D. 


// Vector to vector associative memory using a Locality Sensitive Hash (LSH.) 
// vecLen= integer power of 2. e.g. 64, 128, 256 
// density= number of LSH bits used to recall in each vector dimension 
// rSeed= pseudorandom sign flip seed, any integer 
class AM { 
constructor(vecLen, density, rSeed) { 
this.vecLen = vecLen; 


this.density = density; 


this.rSeed = rSeed; // Seed for pseudorandom sign flips 

this.weights = new Float32Array(vecLen * density); 

this.Ish = new Int8Array(vecLen * density); // Store locality sensitive hash. 
this.workA = new Float32Array(vecLen); 

this.workB = new Float32Array(vecLen); 


} 


train(target, input) { 


this.recall(this.workB, input); // Recall and store LSH. 
subtractVec(this.workB, target, this.workB); // Error vector. 
scaleVec(this.workB, this.workB, 1.0 / this.density); // Scale correctly before distributing 


over the weights. 
let wtldx = 0; // Weight index 
for (let i = 0; i < this.density; i++) { 
for (let j = 0; j < this.vecLen; j++) { 
this.weights[wtldx] += this.workB[j] * this.Ish[wtldx]; // Adjust weights to give zero error. 
wtldx++; 
} 
} 
} 


recall(result, input) { 
copyVec(this.workA, input); 
zeroVec(result); 
let rndSeed = this.rSeed; 
let wtldx = 0; 


for (let i = 0; i < this.density; i++) { 


randomProjection(this.workA, rndSeed++); // Random projections based on different 
seeds. 
for (let j = 0; j < this.vecLen; j++) { 
const sign = this.workA|j] <0 ?-1: 1; // Locality Sensitive Hash bit. 
this.Ish[wtldx] = sign; // Store to avoid recomputing LSH during training. 


result[j] += sign * this.weights[wtldx++]; // Weight each bit in the LSH and sum over 


density dimensions. 


} 
} 


function randomProjection(vec, rndSeed) { 


signFlipVec(vec, rndSeed); // Pseudo-random sign flips based on rndSeed. 


wht(vec); // Fast Walsh Hadamard transform. 


scaleVec(vec, vec, 1.0 / Math.sqrt(vec.length)); // Normalize to leave vector length invariant. 


} 


// Pseudo-random sign flip of the elements of vec based on rndSeed. 
function signFlipVec(vec, rndSeed) { 
for (let i = 0, n = vec.length; i < n; i++) { 

rndSeed += Ox3c6ef35f; 

rndSeed *= 0x19660d; 

rndSeed &= Oxffffffff; 

if (((rndSeed * 0x9e3779b9) & Ox80000000) === 0) { 

vec[i] = -vec[i]; 
} 
} 


function scaleVec(rVec, xVec, sc) { 
for (let i= 0, n = rVec.length; i <n; i++) { 
rVec[i] = xVec[i] * sc; 
} 
} 


// x-y 
function subtractVec(rVec, xVec, yVec) { 
for (let i= 0, n = rVec.length; i < n; i++) { 
rVec[i] = xVec[i] - yVec[i]; 
} 
} 


function copyVec(rVec, xVec) { 
for (let i= 0, n = rVec.length; i <n; i++) { 
rVec([i] = xVec[i]; 
} 
} 


function zeroVec(x) { 
for (let i= 0, n =x.length; i <n; i++) { 
x[i] = 0; 
} 
} 


Appendix E. 


// Fixed filter bank neural network. 
// Uses the fast (Walsh) Hadamard Transform as a system of fixed 
// weighted sums and parametric activation functions as the 
// adjustable aspect. 
// No training algorithm given here. See: 
// https://editor.p5js.org/congchuatocmaydangyeu7/sketches/8qpZEPHf2 
// Or the online resources. 
class FFBNet { 
//vecLen must be 2,4,8,16,32..... 
constructor(vecLen, depth) { 
this.vecLen = vecLen; 
this.depth = depth; 
this.scale=1.0/Math.sqrt(vecLen); 
this.params = new Float32Array(2 * vecLen * depth); 
this.signFlips=new Float32Array(vecLen); 
for (let i = 0; i < this.params.length; i++) { 
this.params[i] = this.scale; // zero curvature initialization 
} 
for(let i=0;i<vecLen;i++){ 
this.signFlips[i]J=Math.sin(i)<O? -1:1; //sub-random pattern 
} 
} 


recall(result, input) { 
let MIN_SQ = 1e-20; 


let adj = this.scale / sqrt((sumSqVec(input) / this.vecLen) + MIN_SQ); 


for(let i=0;i<this. vecLen;i++){ 
result[i]=input[i]*this.signFlips[i]*adj; 
} 
let paldx = 0; // parameter index 
for (let i = 0; i < this.depth; i++) { 
wht(result); 
for (let j = 0; j < this.vecLen; j++) { 
result[j] *= result[j] < O ?this.params[paldx]:this.params[paldx+1] 
paldx += 2; 
} 
} 
wht(result); 
} 
} 


function sumSqVec(vec) { 
let sum = 0.0; 
for (let i = 0, n = vec.length; i < n; i++) { 
sum += vec[i] * vec[i]; 
} 
return sum; 
} 
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